PSCI 8357 - STAT II
Department of Political Science, Vanderbilt University
January 12, 2026
Examples:
Incumbency advantage:
What would have been the election outcome if the candidate had not been an incumbent?
Democratic peace:
Would the two countries have fought each other if they had been both autocratic?
Policy intervention:
How many more disadvantaged youths would get employed under the new job training program?
DEFINITION: Treatment
\(T_i\): Indicator of treatment intake for unit \(i\), where \(i = 1, ..., N\)
\[ T_i = \begin{cases} 1 & \text{if unit } i \text{ received the treatment} \\ 0 & \text{otherwise} \end{cases} \]
DEFINITION: Observed Outcome
\(Y_i\): Variable of interest whose value may be affected by the treatment
DEFINITION: Potential Outcome
\(Y_{i} (t)\): Value of the outcome that would be realized if unit \(i\) received the treatment \(t\), where \(t \in \{ 0, 1\}\)
\[ Y_{i} (t) = \begin{cases} Y_{i} (1) & \text{Potential outcome for unit } i \text{ under treatment} \\ Y_{i} (0) & \text{Potential outcome for unit } i \text{ under no treatment} \end{cases} \]
DEFINITION: Unit Treatment Effect
Causal effect of the treatment on the outcome for unit \(i\) is the difference between its two potential outcomes:
\[ \tau_i = Y_{i} (1) - Y_{i} (0) \]
\[ Y_i = \begin{cases} Y_{i} (1) & \text{if } T_i=1 \\ Y_{i} (0) & \text{if } T_i=0 \end{cases} \]
Fundamental Problem of Causal Inference (Holland 1986):
Without assumptions, it is in general impossible to learn about causal effects \(\rightarrow\) we can think that causal inference helps:
One “heroic solution” is to assume unit homogeneity
If \(Y_{i} (1)\) and \(Y_{i} (0)\) are constant across individual units, then cross-sectional comparisons will recover \(\tau = \tau_i\)
If \(Y_{i} (1)\) and \(Y_{i} (0)\) are constant across time, then before-and-after comparisons will recover \(\tau = \tau_i\)
DEFINITION: Average Treatment Effect (ATE)
\[ \begin{align*} \tau_{ATE} &= \frac{1}{N}\sum_{i=1}^N \left\{Y_{i}(1) - Y_{i}(0) \right\} &&\textit{(finite-population)}\\ \tau_{ATE} &= {\mathbb{E}}[Y_{i}(1) - Y_{i}(0)] &&\textit{(super-population)} \end{align*} \]
Example: The average effect of a GOTV mail on the voter turnout.
Note: that \(\tau_{ATE}\) is still unidentified
In the rest of this course, we will consider various assumptions under which \(\tau_{ATE}\) can be identified from observed information
DEFINITION: Average Treatment Effect on the Treated (ATT)
Let \(N_1 \equiv \sum_{i=1}^N T_i\), then
\[ \begin{align*} \tau_{ATT} &= \frac{1}{N_1}\sum_{i=1}^N T_i\left\{Y_{i} (1) - Y_{i} (0) \right\} &&\textit{(finite-population)}\\ \tau_{ATT} &= {\mathbb{E}}[Y_{i}(1) - Y_{i}(0) \mid T_i = 1] &&\textit{(super-population)} \end{align*} \]
DEFINITION: Conditional Average Treatment Effect (CATE)
\[ \tau_{CATE}(x) = {\mathbb{E}}[ Y_{i}(1) - Y_{i}(0) {\:\vert\:}X_i = x] \]
| \(i\) | \(T_i\) | \(Y_i\) | \(Y_{i}(1)\) | \(Y_{i}(0)\) | \(\tau_i\) |
|---|---|---|---|---|---|
| 1 | 1 | 3 | 3 | ? | ? |
| 2 | 1 | 1 | 1 | ? | ? |
| 3 | 0 | 0 | ? | 0 | ? |
| 4 | 0 | 1 | ? | 1 | ? |
| \(i\) | \(T_i\) | \(Y_i\) | \(Y_{i}(1)\) | \(Y_{i}(0)\) | \(\tau_i\) |
|---|---|---|---|---|---|
| 1 | 1 | 3 | 3 | ? | ? |
| 2 | 1 | 1 | 1 | ? | ? |
| 3 | 0 | 0 | ? | 0 | ? |
| 4 | 0 | 1 | ? | 1 | ? |
| \({\mathbb{E}}[Y_{i}(1) {\:\vert\:}T_i = 1]\) | 2 | ||||
| \({\mathbb{E}}[Y_{i}(0) {\:\vert\:}T_i = 0]\) | 0.5 |
| \(i\) | \(T_i\) | \(Y_i\) | \(Y_{i}(1)\) | \(Y_{i}(0)\) | \(\tau_i\) |
|---|---|---|---|---|---|
| 1 | 1 | 3 | 3 | 0 | 3 |
| 2 | 1 | 1 | 1 | 1 | 0 |
| 3 | 0 | 0 | 1 | 0 | 1 |
| 4 | 0 | 1 | 1 | 1 | 0 |
| \({\mathbb{E}}[Y_{i}(1)]\) | 1.5 | ||||
| \({\mathbb{E}}[Y_{i}(0)]\) | 0.5 |
| \(i\) | \(T_i\) | \(Y_i\) | \(Y_{i}(1)\) | \(Y_{i}(0)\) | \(\tau_i\) |
|---|---|---|---|---|---|
| 1 | 1 | 3 | 3 | 0 | 3 |
| 2 | 1 | 1 | 1 | 1 | 0 |
| 3 | 0 | 0 | 1 | 0 | 1 |
| 4 | 0 | 1 | 1 | 1 | 0 |
| \({\mathbb{E}}[Y_{i}(1) {\:\vert\:}T_i = 1]\) | 2 | ||||
| \({\mathbb{E}}[Y_{i}(0) {\:\vert\:}T_i = 1]\) | 0.5 |
ASSUMPTION: SUTVA
\[ Y_{i} (\mathbf{t}) = Y_{i} (\mathbf{t^{\prime}}) \quad \text{if } t_{i} = t_{i}^{\prime} \]
SUTVA consists of two sub-assumptions:
No interference: Potential outcomes for a unit must not be affected by treatment for any other units. Violations: spillover effects, contagion, dilution, displacement, communication
Consistency: Nominally identical treatments are in fact identical. Violations: Variable levels of treatment, technical errors, fertilizer on plot yield
\[ \begin{array}{cc} Y_{1}(\textcolor{#b16286}{(1,1)}) - Y_{1}(\textcolor{#8ec07c}{(0,0)}), &Y_{1}(\textcolor{#b16286}{(1,1)}) - Y_{1}(\textcolor{#d65d0e}{(0,1)}), \\ Y_{1}(\textcolor{#b16286}{(1,1)}) - Y_{1}(\textcolor{#928374}{(1,0)}), &Y_{1}(\textcolor{#928374}{(1,0)}) - Y_{1}(\textcolor{#8ec07c}{(0,0)}),\\ Y_{1}(\textcolor{#d65d0e}{(0,1)}) - Y_{1}(\textcolor{#8ec07c}{(0,0)}). &\\ \end{array} \]
Without SUTVA, causal inference becomes EXPONENTIALLY more difficult as \(N\) increases (formally we have \(2^N\) potential outcomes).
\[ \begin{align*} \hat\tau &= {\mathbb{E}}[Y_i {\:\vert\:}T_i=1]-{\mathbb{E}}[Y_i {\:\vert\:}T_i=0] &&\\ &\class{fragment}{{}= {\mathbb{E}}[Y_{i} (1) {\:\vert\:}T_i=1]-{\mathbb{E}}[Y_{i} (0) {\:\vert\:}T_i=0] \quad \text{($\because$ switching equation)}}\\ &\class{fragment}{{}= \underbrace{{\mathbb{E}}[Y_{i} (1) - Y_{i} (0) {\:\vert\:}T_i=1]}_{\tau_{ATT}} + \underbrace{{\mathbb{E}}[Y_{i} (0) {\:\vert\:}T_i=1]-{\mathbb{E}}[Y_{i} (0) {\:\vert\:}T_i=0]}_{\text{Selection bias}} \quad \text{($\because \pm {\mathbb{E}}[Y_{i} (0) {\:\vert\:}T_i=1]$)}} \end{align*} \]
\[ \begin{align*} \hat\tau &= {\mathbb{E}}[Y_i {\:\vert\:}T_i=1]-{\mathbb{E}}[Y_i {\:\vert\:}T_i=0] &&\\ &= {\mathbb{E}}[Y_{i} (1) {\:\vert\:}T_i=1]-{\mathbb{E}}[Y_{i} (0) {\:\vert\:}T_i=0] \quad \text{($\because$ switching equation)}\\ &= \underbrace{{\mathbb{E}}[Y_{i} (1) - Y_{i} (0) {\:\vert\:}T_i=1]}_{\tau_{ATT}} + \underbrace{{\mathbb{E}}[Y_{i} (0) {\:\vert\:}T_i=1]-{\mathbb{E}}[Y_{i} (0) {\:\vert\:}T_i=0]}_{\text{Selection bias}} \quad \text{($\because \pm {\mathbb{E}}[Y_{i} (0) {\:\vert\:}T_i=1]$)} \end{align*} \]
Example: Church attendance and turnout
Example: Job training program for the disadvantaged
\[ \begin{align*} \hat\tau &= {\mathbb{E}}[Y_i {\:\vert\:}T_i = 1]-{\mathbb{E}}[Y_i {\:\vert\:}T_i = 0] \\ &= \underbrace{{\mathbb{E}}[Y_{i} (1) - Y_{i} (0) {\:\vert\:}T_i = 1]}_{\tau_{ATC}} +\underbrace{{\mathbb{E}}[Y_{i} (1) {\:\vert\:}T_i=1]-{\mathbb{E}}[Y_{i} (1) {\:\vert\:}T_i=0]}_{\text{Selection bias wrt $Y_i(1)$}} \end{align*} \]
\[ \begin{multline} {\mathbb{E}}[Y_i {\:\vert\:}T_i = 1] - {\mathbb{E}}[Y_i {\:\vert\:}T_i = 0] = \tau_{ATE} \\ + \underbrace{{\mathbb{E}}[Y_{i}(0) {\:\vert\:}T_i = 1] - {\mathbb{E}}[Y_{i}(0) {\:\vert\:}T_i = 0]}_{\text{Selection bias wrt $Y_i(0)$}} + (1 - \pi)(\underbrace{{\mathbb{E}}[\tau_{i} {\:\vert\:}T_i = 1] - {\mathbb{E}}[\tau_{i} {\:\vert\:}T_i = 0]}_{\text{Selection bias wrt $\tau_i$}}), \\\text{where } \pi = {\textrm{Pr}}[T_i = 1]. \end{multline} \]
(Causal) Identification
Estimation and Inference (standard statistics)
\[ {\mathbb{E}}[ Y_{i} (0) | T_i = 1] = E[Y_{i} (0) | T_i = 0] = E[Y_{i} (0)] \implies \text{no selection bias} \]
\[ E[Y_{i} (1) | T_i = 1] - E[Y_{i} (0) | T_i = 1] = E[Y_{i} (1) - Y_{i} (0)] \quad \text{($ATT$ is the same as $ATE$)} \]
\[ {\mathbb{E}}[ Y_{i} (0) - Y_{i} (1) | X_i = x] = \tau_{CATE} \]
\[ {\mathbb{E}}[Y_{i} | T_i = 1, X_i = x] - {\mathbb{E}}[Y_{i} | T_i = 0, X_i = x] = \tau_{CATE} (x) \]
\[ \sum_{x \in \mathcal{X}} \tau_{CATE} (x) p(x) = \tau_{ATE} \]
Separation of Causal Estimands, Identification, and Estimation/Inference
Identification Strategies (Designs)
E.g., randomized experiment, conditional ignorability, absence of omitted variables
Or instrumental variables, Regression Discontinuity (RDD), Difference-in-Differences (DID)
Estimation Strategies
\[ \begin{align*} Y &= \alpha_0 + \alpha_1 Z + \alpha_2 X_1 + \alpha_2 X_3 + \epsilon_\alpha, \\ Z &= \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3 + \beta_4 X_4 + \epsilon_\beta\\ . . .& \end{align*} \]
A causal diagram is a directed acyclic graph (DAG) composed of:
Missing edges encode causal assumptions:
Missing arrows encode exclusion restrictions
Missing dashed arcs encode independencies between error terms
A causal DAG has a one-to-one relationship with an NPSEMa:
\[ \begin{align*} X &= f_X(U_X), \\ T &= f_T(X, U_T), \\ Y &= f_Y(T, X, U_Y) \end{align*} \]
These are structural equations (as opposed to algebraic) and represent causation – the equal signs are thus directional (i.e., no moving around)
Treatments (interventions) are represented by the \(do()\) operator
\[ X = x_0, \quad T = f_T (x_0, U_T),\quad Y = f_Y(T, x_0, U_Y) \]
The pre-intervention distribution: \(p(y, t, x)\)
The post-intervention distribution: \(p(y, x {\:\vert\:}do(t_0))\)
\[ {\mathbb{E}}[Y {\:\vert\:}do(t_1)] - {\mathbb{E}}[Y {\:\vert\:}do(t_0)] \]
Example:
Admissions process where \(T\) is students’ grades, \(Y\) is motivation, both influencing \(X\), the admission decision.
Conditioning on \(X\) (e.g. admitted) can create a misleading relationship between \(Z\) and \(Y\)…
Example:
In Pearl’s lung cancer study, \(T\) is smoking, \(X\) is wearing a seatbelt, \(Y\) is lung cancer. \(U_1\) is social norms following and \(U_2\) is health norms following.
Conditioning on \(X\) creates a false correlation between \(T\) and \(Y\), producing M-bias.
For example the following NPSEM
\[ X = f_X(U_X), \quad T = f_T(X, U_T), \quad Y = f_Y(T, U_Y) \]
directly corresponds to the following potential outcomes: \(X_i\), \(T_{i} (X_i)\), and \(Y_{i} (T_i)\).
Because of this fundamental equivalence, we will mostly work with potential outcomes, currently the standard framework in social sciences.
Note: Graphs are useful for expressing and visualizing a causal model in empirical research.
Imbens and Rubin (2015):
Pearl’s work is interesting, and many researchers find his arguments that path diagrams are a natural and convenient way to express assumptions about causal structures appealing. In our own work, perhaps influenced by the type of examples arising in social and medical sciences, we have not found this approach to aid drawing of causal inferences.
Pearl’s blog post:
So, what is it about epidemiologists that drives them to seek the light of new tools, while economists seek comfort in partial blindness, while missing out on the causal revolution? Can economists do in their heads what epidemiologists observe in their graphs? Can they, for instance, identify the testable implications of their own assumptions? Can they decide whether the IV assumptions are satisfied in their own models of reality? Of course they can’t; such decisions are intractable to the graph-less mind.
Front-Door criteria (Pearl 1995a)
do-calculus (Pearl 1995b)
ID algorithm (Tian and Pearl 2002)
ID algorithm is sound and complete (Shpitser and Pearl 2006)
Causal DAGs are useful for understanding
dynamic causal effects (Hernán and Robins 2020)
interference (Ogburn and VanderWeele 2014)
external validity (Bareinboim and Pearl 2016)
mediation Imai, Keele, and Tingley (2010)